Using Very Large Parsed Corpora and Judgment Data to Classify Verb Reflexivity
Dutch has two reflexive pronouns, zich and zichzelf. When is each one used? This question has been debated in the literature on binding theory, reflexives and anaphora resolution. Partial solutions have attempted to use syntactic binding domains, semantic features and pragmatic concepts such as focus to predict reflexive choice, but until now no experimental data either in favor of or against one of these theories is available. In this paper we look at reflexive choice on the basis of empirical data: a large scale corpus study and an online questionnaire. On the basis of the results of both experiments, we are able to predict the choice between the two reflexive items in Dutch without assuming a distinction between verbs that occur with zich or zichzelf a priori (cf. a distinction in terms like ‘inherent reflexivity’ (Reinhart and Reuland, 1993)). Instead, we examine the distribution of zich and zichzelf using the Clef corpus, a 70 million word Very Large Corpus of Dutch. The corpus is tagged and parsed. This allows us to identify the typical action the verbs are used to describe: reflexive or non-reflexive actions. Regression analysis shows that, by doing so, 21% of the distribution of the two reflexive items in Dutch can be predicted. Using the verb reflexivity found in the corpus study even allows us to explain 83% of the participants’ choices in the online study between zich and zichzelf. As such, both the corpus study and the online questionnaire confirm the group of verbs called ‘inherent reflexive verbs’ without postulating the group beforehand. We further discovered that even inherently reflexive verbs, which are argued to never co-occur with zichzelf, sometimes had zichzelf chosen as the preferred argument in the questionnaire, and to a lesser degree, in the corpus suggesting that the verb classes are tendential and not categorical. 1 Two Reflexives, One Meaning? Dutch, like German, French, Swedish and Danish, but unlike English, has two reflexive pronouns: zich and zichzelf, both unspecified for gender, number and case: A. Branco (Ed.): DAARC 2007, LNAI 4410, pp. 77–93, 2007. c © Springer-Verlag Berlin Heidelberg 2007 78 E.-J. Smits, P. Hendriks, and J. Spenader (1) Jan Jan wast washes zich/zichzelf. SE/SELF ‘Jan washes himself’ (2) Jan Jan schaamt schames zich/*zichzelf. SE/*SELF ‘Jan is ashamed of himself’ (1) can be used with both zich and zichzelf, while (2) seems only to be possible with zich. There has been much theoretical debate about what features predict the choice of zich or zichzelf. The choice has been argued to be the result of syntactic constraints (Broekhuis 2004, Reuland and Koster 1991), to be strongly affected by semantic properties of the verb (Haeseryn et al. 2002 (Algemene Nederlandse Spraakkunst, ANS), Reinhart and Reuland 1993, Lidz 2001) by the degree of affectiveness of the situation (Everaert 1986, Geurts 2004), or by the placement of focus (Everaert, 1986). However, as far as we know there are no large-scale corpus studies or questionnaire studies documenting the use of zich and zichzelf. Such data, however, is important for several reasons: first, heuristics for the types of objects a given verb tends to co-occur with can improve parsing. Second, the choice of reflexive zich with a non-reflexive verb is suggested to be related to the habitualness of the event in the context. Confirming this empirically would mean we have a new surface clue to habitual events, an interesting result for natural language understanding. Third, the acquisition of reflexives and pronouns is a major topic in child language. To correctly make materials and interpret results for Dutch and other language with two reflexives we need to know what their uses are. Finally, the results should be relevant to the choice of the reflexive in natural language generation. The purpose of this study is to see to what degree a large-scale corpus study and an online questionnaire can help predict the choice between zich and zichzelf. Through an analysis of the distribution of zich and zichzelf among predicate types, we also address the existence of a number of different classes of reflexivity which can be found in the literature (among other terms inherent reflexive verbs, necessarily reflexive verbs, accidental reflexive verbs). We do this by examining the use of each predicate and looking at how often the action denoted by the verb is reflexively performed in the corpus compared to how often it is performed to some other party. The experimental data points out that it is only possible to do so if both reflexive and non-reflexive transitive uses are taken into account, considering both corpus and questionnaire data.
منابع مشابه
Unsupervised Acquisition of Verb Subcategorization Frames from Shallow-Parsed Corpora
In this paper, we reported experiments of unsupervised automatic acquisition of Italian and English verb subcategorization frames (SCFs) from general and domain corpora. The proposed technique operates on syntactically shallow-parsed corpora on the basis of a limited number of search heuristics not relying on any previous lexico-syntactic knowledge about SCFs. Although preliminary, reported res...
متن کاملDiscarding Noise in an Automatically Acquired Lexicon of Support verb Constructions
We applied data-driven methods to carry out automatic acquisition of Dutch prepositional support verb constructions (SVCs) in corpora (e.g., iets in de gaten houden (“keep an eye on something”)). This paper addresses the question whether linguistic diagnostics help to discard noise from thenbest lists and how to (semi-)automatically apply such linguistic diagnostics to parsed corpora. We show t...
متن کاملParsed Corpora for Linguistics
Knowledge-based parsers are now accurate, fast and robust enough to be used to obtain syntactic annotations for very large corpora fully automatically. We argue that such parsed corpora are an interesting new resource for linguists. The argument is illustrated by means of a number of recent results which were established with the help of parsed corpora.
متن کاملDevelopment of an ESP E-learning Tool Using In-House Corpora
This study introduces a methodology for developing an elearning tool by using corpora and computer software for linguistic analysis. The corpora compiled for this study are of journal articles from two engineering fields, and of articles from general science magazines. Each corpus, consisting of approximately 500000 words, is tagged and parsed. Analysis of these corpora reveals that the past pa...
متن کاملUsing Parsed Corpora for Structural Disambiguation in the TRAINS Domain
This paper describes a prototype disambiguation module KANKEI which uses two corpora of the TRAINS project In ambiguous verb phrases of form V NP PP or V NP adverb s the two corpora have very di erent PP and adverb attachment patterns in the rst the correct attachment is to the VP of the time while in the second the correct attachment is to the NP of the time KANKEI uses various n gram patterns...
متن کامل